Unsupervised Dimension Reduction of High-Dimensional Data for Cluster Preservation

نویسندگان

  • Enikö Szekely
  • Eric Bruno
  • Stéphane Marchand-Maillet
چکیده

High-dimensional data is receiving increasing attention in more and more application fields, but the analysis of such data has shown to be difficult due to the “curse of dimensionality”. Dimension reduction methods have emerged as successful tools to overcome the problem of high-dimensionality. However, even if they are designed to preserve the most important properties of the data, they are generally blind to the preservation of structures (e.g. multimodal distributions, clusters). In this paper, we propose a class of dimension reduction strategies, called High-Dimensional Multimodal Embedding (HDME), that aim to find low-dimensio-nal representations of high-dimensional data that preserve cluster information. The difficulty of analysing high-dimen-sional data arises from the fact that, in high-dimensional representation spaces, all pairwise distances between points tend to become equal. To overcome the problem of equidistancy, HDME performs a processing of the distances, consisting of a scaling of the distances between similar data points. Similarity may be estimated based on neighbourhood, cluster or class information. We show that the neigh-bourhood-based variant is a competitive alternative to clustering. After the scaling, the points are embedded in a low-dimensional space using a distance-based embedding method. Experiments show that HDME is effective both in terms of retrieval and clustering when compared to known state-of-the-art methods operating in high-dimensional spaces. The code and data are available from http://viper.unige.ch/doku.php/viper_private:HDME.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

High-Dimensional Unsupervised Active Learning Method

In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...

متن کامل

بهبود مدل تفکیک‌کننده منیفلدهای غیرخطی به‌منظور بازشناسی چهره با یک تصویر از هر فرد

Manifold learning is a dimension reduction method for extracting nonlinear structures of high-dimensional data. Many methods have been introduced for this purpose. Most of these methods usually extract a global manifold for data. However, in many real-world problems, there is not only one global manifold, but also additional information about the objects is shared by a large number of manifolds...

متن کامل

Unsupervised Kernel Dimension Reduction

We apply the framework of kernel dimension reduction, originally designed for supervised problems, to unsupervised dimensionality reduction. In this framework, kernel-based measures of independence are used to derive low-dimensional representations that maximally capture information in covariates in order to predict responses. We extend this idea and develop similarly motivated measures for uns...

متن کامل

Multilayer bootstrap network for unsupervised speaker recognition

We apply multilayer bootstrap network (MBN), a recent proposed unsupervised learning method, to unsupervised speaker recognition. The proposed method first extracts supervectors from an unsupervised universal background model, then reduces the dimension of the high-dimensional supervectors by multilayer bootstrap network, and finally conducts unsupervised speaker recognition by clustering the l...

متن کامل

Integrated constraint based clustering algorithm for high dimensional data

Dimension selection, dimension weighting and data assignment are three circular dependent essential tasks for high dimensional data clustering and each such task is challenging. To meet the challenge of high dimensional data clustering, constraints have been employed in several previous works. However, these constraint based algorithms use constraints to help accomplish only one of the three es...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008